Learning Lexical Embeddings with Syntactic and Lexicographic Knowledge

نویسندگان

  • Tong Wang
  • Abdel-rahman Mohamed
  • Graeme Hirst
چکیده

We propose two improvements on lexical association used in embedding learning: factorizing individual dependency relations and using lexicographic knowledge from monolingual dictionaries. Both proposals provide low-entropy lexical cooccurrence information, and are empirically shown to improve embedding learning by performing notably better than several popular embedding models in similarity tasks. 1 Lexical Embeddings and Relatedness Lexical embeddings are essentially real-valued distributed representations of words. As a vectorspace model, an embedding model approximates semantic relatedness with the Euclidean distance between embeddings, the result of which helps better estimate the real lexical distribution in various NLP tasks. In recent years, researchers have developed efficient and effective algorithms for learning embeddings (Mikolov et al., 2013a; Pennington et al., 2014) and extended model applications from language modelling to various areas in NLP including lexical semantics (Mikolov et al., 2013b) and parsing (Bansal et al., 2014). To approximate semantic relatedness with geometric distance, objective functions are usually chosen to correlate positively with the Euclidean similarity between the embeddings of related words. Maximizing such an objective function is then equivalent to adjusting the embeddings so that those of the related words will be geometrically closer. The definition of relatedness among words can have a profound influence on the quality of the resulting embedding models. In most existing studies, relatedness is defined by co-occurrence within a window frame sliding over texts. Although supported by the distributional hypothesis (Harris, 1954), this definition suffers from two major limitations. Firstly, the window frame size is usually rather small (for efficiency and sparsity considerations), which increases the false negative rate by missing long-distance dependencies. Secondly, a window frame can (and often does) span across different constituents in a sentence, resulting in an increased false positive rate by associating unrelated words. The problem is worsened as the size of the window increases since each false-positive n-gram will appear in two subsuming false-positive (n+1)-grams. Several existing studies have addressed these limitations of window-based contexts. Nonetheless, we hypothesize that lexical embedding learning can further benefit from (1) factorizing syntactic relations into individual relations for structured syntactic information and (2) defining relatedness using lexicographic knowledge. We will show that implementation of these ideas brings notable improvement in lexical similarity tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

L2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors

This study was intended first to categorize the L2 learners in terms of their learning style preferences and second to investigate if their learning preferences are related to lexical inferencing. Moreover, strategies used for lexical inferencing and text related issues of text density and parts of speech were studied to determine their moderating effects and the best predictors of lexical infe...

متن کامل

M ODELS by Tong Wang A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Exploiting Linguistic Knowledge in Lexical and Compositional Semantic Models Tong Wang Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2016 A fundamental principle in distributional semantic models is to use similarity in linguistic environment as a proxy for similarity in meaning. Known as the distributional hypothesis, the principle has been successfully app...

متن کامل

Lexical Profiling for Arabic

We provide lexical profiling for Arabic by covering two important linguistic aspects of Arabic lexical information, namely morphological inflectional paradigms and syntactic subcategorization frames, making our database a rich repository of Arabic lexicographic details. First, we provide a complete description of the inflectional behaviour of Arabic lemmas based on statistical distribution. We ...

متن کامل

Enriching Word Embeddings Using Knowledge Graph for Semantic Tagging in Conversational Dialog Systems

Unsupervised word embeddings provide rich linguistic and conceptual information about words. However, they may provide weak information about domain specific semantic relations for certain tasks such as semantic parsing of natural language queries, where such information about words can be valuable. To encode the prior knowledge about the semantic word relations, we present new method as follow...

متن کامل

The Relationship between Syntactic and Lexical Complexity in Speech Monologues of EFL Learners

: This study aims to explore the relationship between syntactic and lexical complexity and also the relationship between different aspects of lexical complexity. To this end, speech monologs of 35 Iranian high-intermediate learners of English on three different tasks (i.e. argumentation, description, and narration) were analyzed for correlations between one measure of sy...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015